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Abstract 

Background: Protein synthetic lethal genetic interactions are useful to define functional relationships between 
proteins and pathways. However, the molecular mechanism of synthetic lethal genetic interactions remains unclear. 

Results: In this study we used the clusters of short polypeptide sequences, which are typically shorter than the 
classically defined protein domains, to characterize the functionalities of proteins. We developed a framework to 
identify significant short polypeptide clusters from yeast protein sequences, and then used these short polypeptide 
clusters as features to predict yeast synthetic lethal genetic interactions. The short polypeptide clusters based 
approach provides much higher coverage for predicting yeast synthetic lethal genetic interactions. Evaluation using 
experimental data sets showed that the short polypeptide clusters based approach is superior to the previous 
protein domain based one. 

Conclusion: We were able to achieve higher performance in yeast synthetic lethal genetic interactions prediction 
using short polypeptide clusters as features. Our study suggests that the short polypeptide cluster may help better 
understand the functionalities of proteins. 



Background 

Defining the functional relationships between proteins is 
essential to understand many aspects of biology. A clas- 
sical approach of understanding gene functional rela- 
tionships is to produce phenotype of combination 
mutant in two genes [1]; such relationships are called 
genetic interactions. Recently, high throughput methods 
[2-4] have been developed to generate large scale 
genetic interactions in model organisms, such as yeast 
[5], Schizosaccharomyces pombe [6] and E. coli. [7], The 
large scale genetic interactions have attracted much 
attention as they are capable of defining the genome- 
wide functional relationships among proteins and are 
fundamental to comprehensive understanding of the 
organization of biological systems [5,8,9]. However, even 
with high throughput methods [2-4], experimental 
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mapping of genetic interactions is still extremely labor 
intensive and one cannot screen genome-wide combina- 
tions in multiple cell organisms with ten thousands of 
genes as of now [10]. Thus, it is imperative to develop 
computational approaches to predict genome-wide 
genetic interactions and help complement and enhance 
wet-lab studies. 

In extreme cases, mutation of two nonessential genes 
can lead to lethal phenotype; this kind of genetic inter- 
action is called synthetic lethal genetic interaction 
(SLGI). Figure 1 illustrates one such synthetic lethal 
genetic interaction. The SLGIs are of interest because 
they are able to reveal functional relationships between 
proteins, pathways and complexes [11-13]. Two syn- 
thetic lethal genes have high probability of occurrence 
in compensatory pathways [14] or compensatory com- 
plexes [15]. Furthermore, the SLGIs are potentially use- 
ful in finding drug targets or drug combinations [16]. 
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Figure 1 Illustration of synthetic lethal genetic interaction A) and 

knocking out both genes. 
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Prediction of SLGIs is impeded by the limit of under- 
standing of genetic interactions. Unlike protein-protein 
interactions that are known as physical dockings among 
proteins, the molecular mechanism under genetic inter- 
actions has not been fully understood. Thus, it is diffi- 
cult to select features and understand how features are 
related to SLGIs. Several computational approaches 
have been proposed for prediction of SLGIs, and many 
features, such as protein interactions, gene expression, 
functional annotation, gene location, protein network 
characteristics, and genetic phenotype, have been used 
by these approaches [10,17-20]. However, those methods 
depend on other genome-wide experimental results. 

It is known as a virtual axiom in biology that the 
"sequence specifies structure and structure determines 
functionality" [21]. We hypothesize that it is possible to 
predict the SLGIs using the characteristics of protein 
sequence alone. Recently, we demonstrated that the 
yeast synthetic lethal genetic interactions can be 
explained by the genetic interactions between domains 
of those proteins [22]. Representing the structures and 
function of proteins, protein domains are usually 
regarded as building blocks of proteins and are con- 
served during evolution. Our studies showed that the 
domain genetic interactions are new type of relationship 
between protein domains. Moreover, we found that dif- 
ferent domains in multi-domain yeast proteins contri- 
bute to their genetic interactions differently. The 



domain genetic interactions help define more precisely 
the function related to the synthetic lethal genetic inter- 
actions, and then help understand how domains contri- 
bute to different functionalities of multi-domain 
proteins. Using the probabilities of domain genetic inter- 
actions, we were able to predict novel yeast synthetic 
lethal genetic interactions. 

However, the feasibility of domain based prediction is 
limited by the coverage of protein domains. For exam- 
ple, only 4480 of more than 6700 yeast proteins contain 
PfamA domains. In this study, we used the short poly- 
peptide sequences, which are typically shorter than the 
classically defined protein domains, to characterize the 
functionalities of proteins. We demonstrated that the 
genetic interaction between a pair of proteins can be 
determined by the genetic interactions between the 
short polypeptide clusters of those proteins. Using short 
polypeptide clusters as features, we can not only 
increase the prediction coverage, but also improve the 
prediction performance. 

Results 

Identifying significant short polypeptide sequence pairs 

We constructed the short polypeptide clusters in three 
steps. First, we identified significant short polypeptide 
sequence pairs from yeast proteins based on the simila- 
rities of local alignments. For each yeast protein A, we 
chopped its protein sequence into short polypeptides 
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with length of L sequentially in a moving window size 
w. Then, we used the Smith- Waterman algorithm [23] 
to search all local matches with similarity scores beyond 
a predefined threshold for each short polypeptide of 
protein A against the sequences of all other yeast pro- 
teins (Figure 2). We used BLOSUM62 to score the simi- 
larity. A significant local match between a short 
polypeptide sequence ai from protein A and a short 
polypeptide sequence bj from protein B indicates a poly- 
peptide sequence pair ai and bj. In addition, if using bj 
from protein B as query can find ai from protein A as a 
significant local match, the short polypeptide pair ai and 
bj is identified as a significant polypeptide sequence pair. 

The significance values of local matching of short 
polypeptide sequences are determined using a p-value 
threshold. For each short polypeptide, We compared 
each short polypeptide to all other short polypeptide 
sequences and obtained a series of similarity scores. We 
then modeled those scores using an extreme value dis- 
tribution. Based on a predefined p-value threshold, we 
determined the similarity score threshold for local 
matching. A local match is significant if its score beyond 
the similarity score threshold. For each short polypep- 
tide sequence, with the same p-value, the similarity 
score threshold to determine the significant local match- 
ing are different. 

We considered following parameters to experiment 
with our method: 1) size L of the short polypeptide 
sequence; 2) moving window size w; 3) penalty for gap 
and mismatch in the alignment; 4) p-value for the 
threshold of similarity score. In this study, we chose the 
size of each short polypeptide sequence L to be 25 and 
the window w was set to 5. The penalty for gap and 
mismatch was chosen to be 14. And, we have used p- 
value equal to 1CT 6 for the threshold of similarities. We 
eventually obtained 3,353,962 short polypeptide 
sequence pairs covering 6711 yeast proteins. And there 
are totally 357,256 unique polypeptide sequences 
involved in these polypeptide sequence pairs. 



Clustering short polypeptide sequences 

After identifying significant polypeptide sequence pairs, 
we developed a clustering algorithm to group similar 
short polypeptide sequences into clusters. Initially, each 
significant polypeptide sequence pair was considered as 
a polypeptide sequence cluster. We first align the signifi- 
cant polypeptide sequence pair using ClustalW [24]. 
Then, we built a hidden Markov model (HMM) using 
the output of the multiple sequence alignment as the 
seed. The HMM model is constructed by the 
HMMbuild tool from HMMER [25]. After that, we 
searched the similar short polypeptide sequences using 
HMM model against all 357,256 short polypeptide 
sequences. The HMMsearch in HMMER [25] was used 
to screen similar polypeptide sequences with signifi- 
cances beyond a threshold. The similar short polypep- 
tide sequences were added to the cluster. Then, the 
above process was repeated until no new short polypep- 
tide sequence was added. 

Several stringent thresholds (1(T 10 , 1(T 15 and 1(T 20 ) 
for HMMsearch were tested in order to include all simi- 
lar short polypeptide sequences and reduce false cluster 
members. Due to the large size of the short polypeptide 
sequence pairs, we first obtained a cluster using each 
short polypeptide sequence pair as the seed. Then, we 
post-processed the short polypeptide sequence clusters. 
This strategy allowed us to easily run the clustering 
algorithm on a computer cluster. 

Post-processing short polypeptide clusters 

Our goal is to use the short polypeptide clusters to 
represent the functionalities of proteins, like the protein 
domains were used in [26]. First, we removed the dupli- 
cate short polypeptide clusters. Second, we merged two 
short polypeptide clusters together under 3 conditions: 
if the clusters share: 1) one polypeptide sequence; 2) 
10% of polypeptide sequences of smaller cluster; and 3) 
20% of polypeptide sequences of smaller cluster. 
Although we used a loose merging criterion, the 
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stringent thresholds used by HMMsearch still allowed 
the short polypeptides in each cluster to have high simi- 
larity. Figure 3 shows the multiple sequence alignment 
of short polypeptides in a merged cluster with 14 short 
polypeptides. We observed that those short polypeptide 
sequences in this cluster are highly conserved with sig- 
nificant number of identical amino acids. 

Then, we filtered out the short polypeptide clusters 
with only two polypeptide sequences as those cluster 
will have no power to predict the SLGIs. We also fil- 
tered out short polypeptide clusters that existed in a 
large number of proteins. Those common short poly- 
peptide clusters exist in both positive and negative data 
and also do not provide prediction power. 

Next, we investigated how the choice of thresholds for 
HMMsearch and for filtering clusters affects the short 
polypeptide clusters. We tested different threshold con- 
figurations that combined one of three E-value thresh- 
olds for HMMsearch: 1(T 10 , 1(T 15 , 1(T 20 ; and one of 
four thresholds for filtering out short polypeptide clus- 
ters: 20, 50, 100, and No Filter. Table 1, 2 and 3 list the 
number of retrained short polypeptide clusters, the 
number of proteins and the number of SLGIs covered 
by the short polypeptide clusters, respectively. In three 
tables, the first column lists the thresholds for filtering 
out short polypeptide clusters; the first row lists the 
threshold for cluster merge and the second row lists the 
E-value thresholds used by HMMsearch. The results 
showed that increasing the cluster merge threshold will 
increase the number of covered proteins and number of 
short polypeptide clusters. Meanwhile, reducing the E- 
value threshold from 10~ to 10~ also increases the 
number of covered proteins and number of short poly- 
peptide clusters. 



However, further reducing the E-value threshold to 10 
20 did not change the number of covered proteins and 
number of short polypeptide clusters. 

The results showed that the short polypeptide clusters 
covered more proteins. For example, 5073 proteins are 
covered by the short polypeptide clusters obtained using 
10~ 15 as HMMsearch threshold, 10% as cluster merge 
threshold and 50 as polypeptide cluster filtering thresh- 
old, comparing to 4480 proteins covered by PfamA 
domains. The maximum number of polypeptide clusters 
contained by a protein is 54 using this parameter config- 
uration. The results indicate that the coverage of short 
polypeptide clusters in yeast proteins is higher than that 
using PfamA domains. The results also showed that the 
short polypeptide clusters covered similar number of 
SLGIs. For example, compared to 7702 SLGIs covered 
by PfamA domains, our polypeptide clusters cover 7681 
SLGIs. 

Predicting yeast synthetic lethal genetic interactions 
using short polypeptide clusters by maximum likelihood 
estimation (MLE) approach 

In order to demonstrate the superiority of using short 
polypeptide clusters to predict SLGIs, we first obtained 
the probabilities of genetic interactions between short 
polypeptide clusters, and then used them to predict the 
probabilities of yeast SLGIs. We assumed that the 
genetic interaction between two short polypeptide clus- 
ters is independent and applied the Maximum Likeli- 
hood estimation (MLE) approach to estimate the 
probabilities of short polypeptide clusters. 

We compared the MLE methods based on short poly- 
peptide clusters, obtained using different HMMsearch, 
cluster filtering and cluster merge thresholds, to the 
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Table 1 The number of short polypeptide clusters obtained with various thresholds for HMMsearch, cluster filtering 



and cluster merge. 



Threshold 
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MLE method based on protein domains. We trained 
those MLE methods using all SLGIs covered by features. 
The MLE based on short polypeptide clusters were able 
to assign the probabilities of genetic interaction to more 
protein pairs. For example, the MLE method based on 
short polypeptide clusters using 1CT 15 as HMMsearch 
threshold, 10% as cluster merge threshold and 50 as 
polypeptide cluster filtering threshold were able to 
assign the probabilities of genetic interaction to 
1,060,860 protein pairs while the MLE based protein 
domains can only assign probabilities of genetic interac- 
tion to 536,175 protein pairs. This result showed that 
short polypeptide cluster based approach provides a 
much higher coverage to predict SLGIs. 

To further evaluate the performance of short polypep- 
tide cluster based MLE method, we tested the MLE 
methods on an experimentally obtained genetic interac- 
tions and non-genetic interactions, which include 3771 
SLGIs and 688,045 non- SLGIs [10]. This data have 
been used by Wong et al. [10]. So we refer this experi- 
mental data as Wong data. We first trained the MLE 
method using SLGIs that are not included in the Wong 
data. Then, we assigned the probabilities of genetic 
interaction to the SLGIs and non SLGIs in Wong data. 
Table 4 lists the AUC (area under ROC curve) values 
for predicting Wong data of MLE methods based on 
short polypeptide clusters obtained using different 
HMMsearch, cluster merge and cluster filtering thresh- 
olds. The performance of MLE based on short polypep- 
tide clusters is slightly better than that of MLE method 
based on protein domains. The AUC score for MLE 
based on short polypeptide clusters using 10~ 15 as 
HMMsearch threshold, 10% as cluster merge threshold 
and 50 as polypeptide cluster filtering threshold is 



0.6761 while the AUC score for MLE based on protein 
domains is 0.6567. 

Discussion and conclusions 

In this study, we developed a framework to identify sig- 
nificant short polypeptide clusters from yeast protein 
sequences. We hypothesized that those short polypep- 
tide clusters represent the functionalities of proteins, 
like the protein domains. We then used these short 
polypeptide clusters as features to predict yeast synthetic 
lethal genetic interactions. The short polypeptide cluster 
based approach provides much higher coverage for pre- 
dicting yeast synthetic lethal genetic interactions. Eva- 
luation using experimental data sets showed that the 
short polypeptide cluster based approach can achieve 
higher performance than the previous protein domain 
based approach. 

In future, we would like to continue improve the iden- 
tification of short polypeptide clusters. Moreover, it is 
worthwhile to develop methods to understanding those 
short polypeptide clusters. Annotating those short poly- 
peptide clusters may help better understand the func- 
tionalities of protein domains. 

Methods 

Source of data 

We downloaded the yeast synthetic lethal genetic inter- 
actions from the Saccharomyces Genome Database 
(SGD) [27] (February 2011 version). There were totally 
11011 synthetic lethal genetic interactions. We down- 
loaded the protein sequences of yeast from GenBank 
[28]. There are totally 6717 proteins with sequences. 
The minimum and maximum lengths of the protein 
sequences in yeast are 16 and 4901 respectively. The 



Table 2 The number of proteins covered with various thresholds for HMMsearch, cluster filtering, and cluster merge. 
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Table 3 The number of SLGIs retained with various thresholds for HMMsearch, cluster filtering, and cluster merge. 



Threshold Merge with at least one overlap Merge with at least 10% overlap Merge with at least 20% overlap 

1Q -10 1Q -15 10 -20 1Q -10 1Q -15 10 -20 1Q -10 1() -15 1 Q -20 
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average length is 450 and the standard deviation of the 
protein sequences is 380. 

Determination of local alignment similarity score 
threshold 

The distribution of scores of local alignments between a 
short polypeptide and all other short polypeptide 
sequences can be described by extreme value distribu- 
tion (EVD): 



x — fl 



F{x) = e 

and the parameters of the EVD can be estimated by: 



(1) 



ii=X- 0.577ip 



(2) 



(3) 



where xand a are the sample mean and standard 
deviation, respectively. 

Based on Karlin-Altshcul statistics [29], the expected 
number of high-scoring segment pairs (HSPs) with 
score higher than S can be obtained by: 



E = Kmne~ xs 



(4) 



where m and n are the lengths of the two sequences 
being compared. The parameter K and X can be 
obtained from parameters of the EVD: 



K 



• eP jmn 



(5) 



(6) 



The p-value of finding at least one HSP with score 
higher than S can be obtained by [29]: 



P = 1 



(7) 



With a given p-value, we can get a corresponding E- 
value. The parameter K and A, can be estimated by sam- 
ple mean and standard deviation of scores. Significant 
similarity score can be computed by equation (4). 

Algorithm to cluster short polypeptide sequences 

The short polypeptide sequence clustering method 

implemented is summarized as follows: 
Input: a pair of short polypeptide sequences 
Initialization: add the short polypeptide sequence pair 

into cluster 

Step 1. Conduct multiple sequence alignment (MSA) 
for the sequences in the cluster using ClustalW; 

Step 2. Build a HMM model using HMMbuild from 
the output of MSA in step 1; 

Step 3. Search all similar short polypeptide sequences 
using HMMsearch and add them to the cluster. If no 
new short polypeptide sequence is added, stop. Else, go 
back to step 1. 

Estimation of probabilities and significances of domain 
genetic interactions 

We treated the protein SLGIs L mi „ and short polypep- 
tide cluster genetic interactions Cy as random variables. 
L m ,„=l if two proteins i and /' genetically interact and 
L m ,„ =0 otherwise. C/ >y =l if two short polypeptide clus- 
ters i and /' genetically interact and Q y =0 otherwise. We 
estimated the probabilities of potential short polypeptide 
cluster interactions Pr(Q ; =l) by maximizing the likeli- 
hood of observed genetic interactions using the Expecta- 
tion-Maximization (EM) algorithm [30-32]. The EM 
algorithm iteratively estimates the maximum likelihood 



Table 4 The AUC values for the predictions of Wong data using different short polypeptide clusters obtained with 
various thresholds for HMMsearch cluster filters, and cluster merges. 

Threshold Merge with at least one overlap Merge with at least 10% overlap Merge with at least 20% overlap 
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of the 'complete data' that combine the observed data 
and unobserved data. Here, the protein genetic interac- 
tions and the short polypeptide cluster information of 
proteins are our observed data and the short polypeptide 
cluster genetic interactions are our unobserved data. 

Assuming short polypeptide cluster genetic interac- 
tions are independent, the likelihood of observed protein 
genetic interactions based on short polypeptide cluster 
genetic interactions can be obtained as: 



L = Y\Pr{C ij = 1)^(1 - Pr(Qj = 1)) 



Nij+Kij+b 



(8) 



We assumed that two proteins genetically interact (L m> 
„ =1) if and only if at least one domain pair from the 
two proteins genetically interact (Q y =i ). Then, we cal- 
culated the probability of two proteins genetically inter- 
acting Pr(L mj „ =1) as following: 



Pr(L m ,„ = 1) = 1.0 - ]^ f 1 " Pr ( C >J = !)) 



ieC(m) 
jeC(n) 



(13) 



A pair of proteins was predicted to be SLGI only if its 
probability is higher than a predefined threshold. 



where My is the number of genetic interacting pairs 
between short polypeptide clusters i and j in all protein 
genetic interactions; Ny is the number of non genetic 
interacting short polypeptide cluster pairs between i and 
j in protein genetic interactions; and Ky is the number 
of non genetic interacting protein pairs including i in 
one protein and j in the other one. The value of Ky is 
computed by counting all possible protein pairs with i 
in one protein and j in the other one with excluding the 
known genetic interacting protein pairs. The Ky will 
remain unchanged during EM computation. The con- 
stants a and b are pseudo counts to avoid the Pr(Cy =1) 
or Pr(Cy =0) to be zero when instances of domains i 
and j are rare. We set both a and Ho 1 in our 
calculation. 

Initially, My was set to the number of genetic interac- 
tions between domain i and j in experimental genetic 
interactions; Ny is set to 0. And Pr(Cy =1) was initia- 
lized as following: 



Pr(Ci,j = 1) 



My 



My + Ny + Ky 



(9) 



In each Expectation step of EM algorithm, we first 
estimated the expected values of E[My] and E[Ny] [31] 
using the current Pr(Cy =1): 



yJ " 1 1 - n< cc(m) , jccw (l - Pr(Cy = 1)) 



(10) 



E[Ny] = £(1 - [— " ' —]) (11) 

~ 1 - n icC (m),jcC(n)(.l - PrfGij = 1JJ 

Then, we calculate the Pr(Cy =1) using the E[M i;j ] and 
E[Ny] as following (Maximization step): 



Pr(Cy = 1) 



E[Mij] + a 



E[Mij] + E[Ni,j] + Kij + a + b 



(12) 



The EM algorithm was over the Expectation and Max- 
imization steps until the change of likelihood L is less 
than a pre-defined small value. 
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